28 research outputs found

    Assessing and Remedying Coverage for a Given Dataset

    Full text link
    Data analysis impacts virtually every aspect of our society today. Often, this analysis is performed on an existing dataset, possibly collected through a process that the data scientists had limited control over. The existing data analyzed may not include the complete universe, but it is expected to cover the diversity of items in the universe. Lack of adequate coverage in the dataset can result in undesirable outcomes such as biased decisions and algorithmic racism, as well as creating vulnerabilities such as opening up room for adversarial attacks. In this paper, we assess the coverage of a given dataset over multiple categorical attributes. We first provide efficient techniques for traversing the combinatorial explosion of value combinations to identify any regions of attribute space not adequately covered by the data. Then, we determine the least amount of additional data that must be obtained to resolve this lack of adequate coverage. We confirm the value of our proposal through both theoretical analyses and comprehensive experiments on real data.Comment: in ICDE 201

    Online Maximum Independent Set of Hyperrectangles

    Full text link
    The maximum independent set problem is a classical NP-hard problem in theoretical computer science. In this work, we study a special case where the family of graphs considered is restricted to intersection graphs of sets of axis-aligned hyperrectangles and the input is provided in an online fashion. We prove bounds on the competitive ratio of an optimal online algorithm under the adaptive offline, adaptive online, and oblivious adversary models, for several classes of hyperrectangles and restrictions on the order of the input. We are the first to present results on this problem under the oblivious adversary model. We prove bounds on the competitive ratio for unit hypercubes, Οƒ\sigma-bounded hypercubes, unit-volume hypercubes, arbitrary hypercubes, and arbitrary hyperrectangles, in both arbitrary and non-dominated order. We are also the first to present results under the adaptive offline and adaptive online adversary models with input in non-dominated order, proving bounds on the competitive ratio for the same classes of hyperrectangles; for input in arbitrary order, we present the first results on Οƒ\sigma-bounded hypercubes, unit-volume hyperrectangles, arbitrary hypercubes, and arbitrary hyperrectangles. For input in dominating order, we show that the performance of the naive greedy algorithm matches the performance of an optimal offline algorithm in all cases. We also give lower bounds on the competitive ratio of a probabilistic greedy algorithm under the oblivious adversary model. We conclude by discussing several promising directions for future work.Comment: 27 pages, 12 figure

    Data-Centric Distrust Quantification for Responsible AI: When Data-driven Outcomes Are Not Reliable

    Full text link
    At the same time that AI and machine learning are becoming central to human life, their potential harms become more vivid. In the presence of such drawbacks, a critical question one needs to address before using these data-driven technologies to make a decision is whether to trust their outcomes. Aligned with recent efforts on data-centric AI, this paper proposes a novel approach to address the trust question through the lens of data, by associating data sets with distrust quantification that specify their scope of use for predicting future query points. The distrust values raise warning signals when a prediction based on a dataset is questionable and are valuable alongside other techniques for trustworthy AI. We propose novel algorithms for computing the distrust values in the neighborhood of a query point efficiently and effectively. Learning the necessary components of the measures from the data itself, our sub-linear algorithms scale to very large and multi-dimensional settings. Besides demonstrating the efficiency of our algorithms, our extensive experiments reflect a consistent correlation between distrust values and model performance. This underscores the message that when the distrust value of a query point is high, the prediction outcome should be discarded or at least not considered for critical decisions

    Responsible Scoring Mechanisms Through Function Sampling

    Full text link
    Human decision-makers often receive assistance from data-driven algorithmic systems that provide a score for evaluating objects, including individuals. The scores are generated by a function (mechanism) that takes a set of features as input and generates a score.The scoring functions are either machine-learned or human-designed and can be used for different decision purposes such as ranking or classification. Given the potential impact of these scoring mechanisms on individuals' lives and on society, it is important to make sure these scores are computed responsibly. Hence we need tools for responsible scoring mechanism design. In this paper, focusing on linear scoring functions, we highlight the importance of unbiased function sampling and perturbation in the function space for devising such tools. We provide unbiased samplers for the entire function space, as well as a ΞΈ\theta-vicinity around a given function. We then illustrate the value of these samplers for designing effective algorithms in three diverse problem scenarios in the context of ranking. Finally, as a fundamental method for designing responsible scoring mechanisms, we propose a novel approach for approximating the construction of the arrangement of hyperplanes. Despite the exponential complexity of an arrangement in the number of dimensions, using function sampling, our algorithm is linear in the number of samples and hyperplanes, and independent of the number of dimensions

    Efficient Computation of Subspace Skyline over Categorical Domains

    Full text link
    Platforms such as AirBnB, Zillow, Yelp, and related sites have transformed the way we search for accommodation, restaurants, etc. The underlying datasets in such applications have numerous attributes that are mostly Boolean or Categorical. Discovering the skyline of such datasets over a subset of attributes would identify entries that stand out while enabling numerous applications. There are only a few algorithms designed to compute the skyline over categorical attributes, yet are applicable only when the number of attributes is small. In this paper, we place the problem of skyline discovery over categorical attributes into perspective and design efficient algorithms for two cases. (i) In the absence of indices, we propose two algorithms, ST-S and ST-P, that exploits the categorical characteristics of the datasets, organizing tuples in a tree data structure, supporting efficient dominance tests over the candidate set. (ii) We then consider the existence of widely used precomputed sorted lists. After discussing several approaches, and studying their limitations, we propose TA-SKY, a novel threshold style algorithm that utilizes sorted lists. Moreover, we further optimize TA-SKY and explore its progressive nature, making it suitable for applications with strict interactive requirements. In addition to the extensive theoretical analysis of the proposed algorithms, we conduct a comprehensive experimental evaluation of the combination of real (including the entire AirBnB data collection) and synthetic datasets to study the practicality of the proposed algorithms. The results showcase the superior performance of our techniques, outperforming applicable approaches by orders of magnitude

    A Fair and Memory/Time-efficient Hashmap

    Full text link
    There is a large amount of work constructing hashmaps to minimize the number of collisions. However, to the best of our knowledge no known hashing technique guarantees group fairness among different groups of items. We are given a set PP of nn tuples in Rd\mathbb{R}^d, for a constant dimension dd and a set of groups G={g1,…,gk}\mathcal{G}=\{\mathbf{g}_1,\ldots, \mathbf{g}_k\} such that every tuple belongs to a unique group. We formally define the fair hashing problem introducing the notions of single fairness (Pr[h(p)=h(x)∣p∈gi,x∈P]Pr[h(p)=h(x)\mid p\in \mathbf{g}_i, x\in P] for every i=1,…,ki=1,\ldots, k), pairwise fairness (Pr[h(p)=h(q)∣p,q∈gi]Pr[h(p)=h(q)\mid p,q\in \mathbf{g}_i] for every i=1,…,ki=1,\ldots, k), and the well-known collision probability (Pr[h(p)=h(q)∣p,q∈P]Pr[h(p)=h(q)\mid p,q\in P]). The goal is to construct a hashmap such that the collision probability, the single fairness, and the pairwise fairness are close to 1/m1/m, where mm is the number of buckets in the hashmap. We propose two families of algorithms to design fair hashmaps. First, we focus on hashmaps with optimum memory consumption minimizing the unfairness. We model the input tuples as points in Rd\mathbb{R}^d and the goal is to find the vector ww such that the projection of PP onto ww creates an ordering that is convenient to split to create a fair hashmap. For each projection we design efficient algorithms that find near optimum partitions of exactly (or at most) mm buckets. Second, we focus on hashmaps with optimum fairness (00-unfairness), minimizing the memory consumption. We make the important observation that the fair hashmap problem is reduced to the necklace splitting problem. By carefully implementing algorithms for solving the necklace splitting problem, we propose faster algorithms constructing hashmaps with 00-unfairness using 2(mβˆ’1)2(m-1) boundary points when k=2k=2 and k(mβˆ’1)(4+log⁑2(3mn))k(m-1)(4+\log_2 (3mn)) boundary points for k>2k>2

    Maximizing Neutrality in News Ordering

    Full text link
    The detection of fake news has received increasing attention over the past few years, but there are more subtle ways of deceiving one's audience. In addition to the content of news stories, their presentation can also be made misleading or biased. In this work, we study the impact of the ordering of news stories on audience perception. We introduce the problems of detecting cherry-picked news orderings and maximizing neutrality in news orderings. We prove hardness results and present several algorithms for approximately solving these problems. Furthermore, we provide extensive experimental results and present evidence of potential cherry-picking in the real world.Comment: 14 pages, 13 figures, accepted to KDD '2
    corecore